Importing Required Libraries¶
import random
import missingno as msno
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_context('notebook') # Configures the aesthetics of the plots for jupyter notebook
sns.set_style('white') # Sets background style of plots to white
import matplotlib.pyplot as plt
%matplotlib inline
# ensures that inline plotting works correctly (newer versions of jypyter notbook does not need this)
from scipy.stats import shapiro
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.datasets import load_diabetes, load_iris
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from fasteda import fast_eda
Regression¶
The aim is to predict a numeric score indicating diabetes progression one year after bloor pressure, BMI and bloor sugar level are recorder using Regression.
Load the diabetes data set (from sklearn)¶
# About the data
print(load_diabetes()['DESCR'])
.. _diabetes_dataset:
Diabetes dataset
----------------
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
**Data Set Characteristics:**
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
# Load the data from sklearn as two pandas.DataFrame - features (X) and target variable (y)
diabetes_X, diabetes_y = load_diabetes(return_X_y = True, as_frame = True, scaled = False)
#Renaming columns
diabetes_X.columns= ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu']
diabetes_X.head()
| age | sex | bmi | bp | tc | ldl | hdl | tch | ltg | glu | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 59.0 | 2.0 | 32.1 | 101.0 | 157.0 | 93.2 | 38.0 | 4.0 | 4.8598 | 87.0 |
| 1 | 48.0 | 1.0 | 21.6 | 87.0 | 183.0 | 103.2 | 70.0 | 3.0 | 3.8918 | 69.0 |
| 2 | 72.0 | 2.0 | 30.5 | 93.0 | 156.0 | 93.6 | 41.0 | 4.0 | 4.6728 | 85.0 |
| 3 | 24.0 | 1.0 | 25.3 | 84.0 | 198.0 | 131.4 | 40.0 | 5.0 | 4.8903 | 89.0 |
| 4 | 50.0 | 1.0 | 23.0 | 101.0 | 192.0 | 125.4 | 52.0 | 4.0 | 4.2905 | 80.0 |
diabetes_y.head()
0 151.0 1 75.0 2 141.0 3 206.0 4 135.0 Name: target, dtype: float64
# Combine both diabetes_X (features) and diabetes_y (target) into one pandas.DataFrame
diabetes = pd.concat([diabetes_X, pd.Series(diabetes_y)], axis=1)
#Renaming the column with target value
diabetes.rename(columns={0: 'target'}, inplace=True)
# Looking into the data
diabetes.sample(5)
| age | sex | bmi | bp | tc | ldl | hdl | tch | ltg | glu | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 43 | 54.0 | 1.0 | 24.2 | 74.0 | 204.0 | 109.0 | 82.0 | 2.0 | 4.1744 | 109.0 | 92.0 |
| 156 | 44.0 | 1.0 | 25.4 | 95.0 | 162.0 | 92.6 | 53.0 | 3.0 | 4.4067 | 83.0 | 25.0 |
| 394 | 58.0 | 1.0 | 28.1 | 111.0 | 198.0 | 80.6 | 31.0 | 6.0 | 6.0684 | 93.0 | 273.0 |
| 88 | 34.0 | 2.0 | 22.6 | 75.0 | 166.0 | 91.8 | 60.0 | 3.0 | 4.2627 | 108.0 | 42.0 |
| 361 | 60.0 | 1.0 | 25.7 | 103.0 | 158.0 | 84.6 | 64.0 | 2.0 | 3.8501 | 97.0 | 182.0 |
Add some missing values¶
The original dataset does not contain any missing value hence for the sake of EDA missing values are introduced to 3 columns and 5% of the rows at random
# Verifying that the data set has no missing values
diabetes.isna().max(axis=0).max()
False
# Intializing seed value to 2000 to make sure that the random value is same each time the code is executed
random.seed(2024)
# Selecting 3 columns at random
missing_cols = random.sample(range(len(diabetes.columns)), 3)
# Selecting 5% of row index at random
missing_rows = random.sample(diabetes.index.tolist(), int(np.round(len(diabetes.index.tolist())/20)))
# Setting missing values to the randomly selected rows and columns
diabetes.iloc[missing_rows, missing_cols] = np.nan
# Having a look at the columns which has been selected in random
print(diabetes.columns[missing_cols])
Index(['tch', 'bmi', 'tc'], dtype='object')
# Now verifying that the data set has missing values
diabetes.isna().max(axis=0).max()
True
Initial Data Preprocessing¶
Note: In a typical workflow data preprocessing comes after conducting EDA
One-Hot Encoding¶
In diabetes dataset sex is encoded as 1 and 2 for female and male, this is not ideal for predictive models as it may consided that the column has some ordering to it. Hence we use One-Hot encoding to create two different columns for each category of sex with binary values in it.
# Initializing OneHotEncoder (ignore unknown categories in dataset, no categories are dropped)
enc1 = OneHotEncoder(handle_unknown='ignore', drop=None)
# One-hot encode 'sex'.
# Double square brackets are used to ensure that the extracted sex data is in DataFrame format which is required by One-hot encoder
# The output from OneHotEncoder is sparse matrix (stores only non-zero elements to save memory) which is converted to numpy array
encoded_sex = enc1.fit_transform(diabetes[['sex']]).toarray()
# Convert numpy array to pandas DataFrame with column names corresponding to its sex category
encoded_sex = pd.DataFrame(encoded_sex, columns=['sex' + str(int(x)) for x in enc1.categories_[0]])
# Horrizontally concatenate dataframes'diabetes' and 'encoded_sex'
diabetes = pd.concat([diabetes, encoded_sex], axis=1)
# Looking into the modified diabetes DataFrame
diabetes.sample(5)
| age | sex | bmi | bp | tc | ldl | hdl | tch | ltg | glu | target | sex1 | sex2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 411 | 50.0 | 1.0 | 31.8 | 82.0 | 136.0 | 69.2 | 55.0 | 2.0 | 4.0775 | 85.0 | 136.0 | 1.0 | 0.0 |
| 78 | 50.0 | 1.0 | 21.0 | 88.0 | 140.0 | 71.8 | 35.0 | 4.0 | 5.1120 | 71.0 | 252.0 | 1.0 | 0.0 |
| 403 | 43.0 | 1.0 | 35.4 | 93.0 | 185.0 | 100.2 | 44.0 | 4.0 | 5.3181 | 101.0 | 275.0 | 1.0 | 0.0 |
| 264 | 58.0 | 2.0 | 29.0 | 85.0 | 156.0 | 109.2 | 36.0 | 4.0 | 3.9890 | 86.0 | 145.0 | 0.0 | 1.0 |
| 5 | 23.0 | 1.0 | 22.6 | 89.0 | 139.0 | 64.8 | 61.0 | 2.0 | 4.1897 | 68.0 | 97.0 | 1.0 | 0.0 |
From above sex is indicated through sex, sex1 and sex2, two of which is redundant hence sex and sex2 can be dropped
# Drop 'sex' and 'sex2' from diabetes DataFrame
diabetes = diabetes.drop(['sex', 'sex2'], axis=1)
# Rename 'sex1' to 'sex'
diabetes = diabetes.rename(columns={'sex1': 'sex'})
# Reorder renamed 'sex' to the previous 'sex' position
diabetes = diabetes.loc[:, ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu', 'target']]
# Looking into a sample of the modified diabetes DataFrame
diabetes.sample(5)
| age | sex | bmi | bp | tc | ldl | hdl | tch | ltg | glu | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 139 | 55.0 | 1.0 | 32.1 | 110.0 | 164.0 | 84.2 | 42.0 | 4.0 | 5.2417 | 90.0 | 281.0 |
| 307 | 67.0 | 0.0 | 23.5 | 96.0 | 207.0 | 138.2 | 42.0 | 5.0 | 4.8978 | 111.0 | 172.0 |
| 282 | 68.0 | 1.0 | 25.9 | 93.0 | 253.0 | 181.2 | 53.0 | 5.0 | 4.5433 | 98.0 | 230.0 |
| 433 | 41.0 | 1.0 | 20.8 | 86.0 | 223.0 | 128.2 | 83.0 | 3.0 | 4.0775 | 89.0 | 72.0 |
| 275 | 47.0 | 0.0 | 25.3 | 98.0 | 173.0 | 105.6 | 44.0 | 4.0 | 4.7622 | 108.0 | 94.0 |
Make a Train-Test Split¶
Below code will randomly assign 33% of the rows to test set and the remaining 67% to training set. Training set is used to train the predictive models and the test set is the unseen data on which predictions are made.
# Make a Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
diabetes.iloc[:,:-1], # Features data (all columns except the last)
diabetes.iloc[:,-1], # Target data (last column)
test_size=0.33, # 33% for testing
random_state=2024 # for reproducibility
)
# `X_train` are the feature columns in the training set.
# `X_test` are the feature columns in the test set.
# `y_train` is the target column for the training set.
# `y_test` is the target column for the test set.
Perform EDA¶
A look at the beginning and end of the data set¶
diabetes.head()
| age | sex | bmi | bp | tc | ldl | hdl | tch | ltg | glu | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 59.0 | 0.0 | 32.1 | 101.0 | 157.0 | 93.2 | 38.0 | 4.0 | 4.8598 | 87.0 | 151.0 |
| 1 | 48.0 | 1.0 | 21.6 | 87.0 | 183.0 | 103.2 | 70.0 | 3.0 | 3.8918 | 69.0 | 75.0 |
| 2 | 72.0 | 0.0 | 30.5 | 93.0 | 156.0 | 93.6 | 41.0 | 4.0 | 4.6728 | 85.0 | 141.0 |
| 3 | 24.0 | 1.0 | 25.3 | 84.0 | 198.0 | 131.4 | 40.0 | 5.0 | 4.8903 | 89.0 | 206.0 |
| 4 | 50.0 | 1.0 | 23.0 | 101.0 | 192.0 | 125.4 | 52.0 | 4.0 | 4.2905 | 80.0 | 135.0 |
diabetes.tail()
| age | sex | bmi | bp | tc | ldl | hdl | tch | ltg | glu | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 437 | 60.0 | 0.0 | 28.2 | 112.00 | 185.0 | 113.8 | 42.0 | 4.00 | 4.9836 | 93.0 | 178.0 |
| 438 | 47.0 | 0.0 | 24.9 | 75.00 | 225.0 | 166.0 | 42.0 | 5.00 | 4.4427 | 102.0 | 104.0 |
| 439 | 60.0 | 0.0 | 24.9 | 99.67 | 162.0 | 106.6 | 43.0 | 3.77 | 4.1271 | 95.0 | 132.0 |
| 440 | 36.0 | 1.0 | 30.0 | 95.00 | 201.0 | 125.2 | 42.0 | 4.79 | 5.1299 | 85.0 | 220.0 |
| 441 | 36.0 | 1.0 | 19.6 | 71.00 | 250.0 | 133.2 | 97.0 | 3.00 | 4.5951 | 92.0 | 57.0 |
Describe the DataFrame¶
# Having a look at the general statistical summaries for the diabetes DataFrame
diabetes.describe()
| age | sex | bmi | bp | tc | ldl | hdl | tch | ltg | glu | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 442.000000 | 442.000000 | 420.000000 | 442.000000 | 420.000000 | 442.000000 | 442.000000 | 420.000000 | 442.000000 | 442.000000 | 442.000000 |
| mean | 48.518100 | 0.531674 | 26.358095 | 94.647014 | 188.830952 | 115.439140 | 49.788462 | 4.071595 | 4.641411 | 91.260181 | 152.133484 |
| std | 13.109028 | 0.499561 | 4.404820 | 13.831283 | 34.690827 | 30.413081 | 12.934202 | 1.296942 | 0.522391 | 11.496335 | 77.093005 |
| min | 19.000000 | 0.000000 | 18.000000 | 62.000000 | 97.000000 | 41.600000 | 22.000000 | 2.000000 | 3.258100 | 58.000000 | 25.000000 |
| 25% | 38.250000 | 0.000000 | 23.175000 | 84.000000 | 164.000000 | 96.050000 | 40.250000 | 3.000000 | 4.276700 | 83.250000 | 87.000000 |
| 50% | 50.000000 | 1.000000 | 25.700000 | 93.000000 | 186.000000 | 113.000000 | 48.000000 | 4.000000 | 4.620050 | 91.000000 | 140.500000 |
| 75% | 59.000000 | 1.000000 | 29.325000 | 105.000000 | 209.000000 | 134.500000 | 57.750000 | 5.000000 | 4.997200 | 98.000000 | 211.500000 |
| max | 79.000000 | 1.000000 | 42.200000 | 133.000000 | 301.000000 | 242.400000 | 99.000000 | 9.090000 | 6.107000 | 124.000000 | 346.000000 |
Missing Values¶
# We know that the dataframe has missing values which can be verified below
diabetes.isna().max(axis=1).max()
True
# To see the summary of missing values in each column
diabetes.isna().sum()
age 0 sex 0 bmi 22 bp 0 tc 22 ldl 0 hdl 0 tch 22 ltg 0 glu 0 target 0 dtype: int64
# Visualizing the missing values in diabetes dataframe
msno.matrix(diabetes)
<Axes: >
It can be easily observed how the missing values occur over the three columns bmi, s1 & s4. There are typically three approaces in dealing with the missing values,
- Dropping the observation with missing values
- Filling the observations with missing values with the mean
- Filling the observations with missing values with the median
Dropping the observations with missing values¶
# Linear refression with dropping NANs
# Getting the Non-NANs indices (observations/rows) of X_train and X_test
nonnan_train_indices = X_train.index[~X_train.isna().max(axis=1)]
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]
# Fit an instance of Linear Regression with train dataset
reg = LinearRegression().fit(X_train.loc[nonnan_train_indices], y_train.loc[nonnan_train_indices])
# Generate predictions for the test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices])
# Finding the root mean squared error between prediction vs test target (y_test)
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
55.962919546725054
Filling the observations with missing values with the mean¶
# Linear regressing with mean fill
# Getting the Non-NAN indices (observations/rows) of X_test as only missing values in train dataset will be filled with mean
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]
# Initializing simple imputer with 'mean' strategy.
# Note: Simple imputer supports mean, median, most_frequest and constant strategies
imp_mean = SimpleImputer(missing_values = np.nan, strategy = 'mean')
# Fit the simple imputer using the training data
imp_mean.fit(X_train)
# Transforming X_train to mean filled dataset and converting it to a pandas DataFrame
X_train_mean_fill = pd.DataFrame(imp_mean.transform(X_train))
# Assigning column names to the above dataframe
X_train_mean_fill.columns= ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu']
# Fit an instance of Linear Regression with mean filled train dataset
reg = LinearRegression().fit(X_train_mean_fill, y_train)
# Generate predictions for the test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices])
# Finding the root mean squared error between prediction vs test target (y_test)
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
55.95122410079265
Filling the observations with missing values with the median¶
# Linear regressing with median fill
# Getting the Non-NAN indices (observations/rows) of X_test as only missing values in train dataset will be filled with median
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]
# Initializing simple imputer with 'median' strategy.
# Note: Simple imputer supports mean, median, most_frequest and constant strategies
imp_median = SimpleImputer(missing_values = np.nan, strategy = 'median')
# Fit the simple imputer using the training data
imp_median.fit(X_train)
# Transforming X_train to median filled dataset and converting it to a pandas DataFrame
X_train_median_fill = pd.DataFrame(imp_median.transform(X_train))
# Assigning column names to the above dataframe
X_train_median_fill.columns= ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu']
# Fit an instance of Linear Regression with median filled train dataset
reg = LinearRegression().fit(X_train_median_fill, y_train)
# Generate predictions for the test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices])
# Finding the root mean squared error between prediction vs test target (y_test)
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
55.9148764740674
The root mean squared error is minimum for linear regressing with missing values filled with median. Looking into ways to impove this.¶
Histograms and Boxplots¶
# Define a function that takes columns_toplt as an argument
def plot_hist_and_box(diabetes, columns_toplt):
for idx, col in enumerate(columns_toplt):
# Creates two subplots (2 plots in a row)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (14, 6))
# Creating a histogram in first subplot (ax1) with KDE overlay
sns.histplot(diabetes, x=diabetes[col], kde=True,
color=sns.color_palette('hls', len(columns_toplt))[idx], ax=ax1)
# Creating a boxplot in second subplot (ax2) with the same color as histogram
sns.boxplot(diabetes, x=diabetes[col], width=0.4, linewidth=3, fliersize=2.5,
color=sns.color_palette('hls', len(columns_toplt))[idx], ax=ax2)
# Adding title to the figure
fig.suptitle(f"Histogram and Boxplot of {col}", size=20, y=1.02)
plt.show()
# Assigning column names of all columns in diabetes dataframe except 'sex'
columns_toplt = [i for i in diabetes.columns if i != 'sex']
# Call the function
plot_hist_and_box(diabetes, columns_toplt)
# looking closely into hdl column
# Assigning column name 'hdl'
columns_toplt = ['hdl']
# Call the function
plot_hist_and_box(diabetes, columns_toplt)
Normality Test on 'hdl'¶
# Normality test on 'hdl'
stat, p = shapiro(X_train['hdl'])
print('Statistics = %.3f, p = %.3f' % (stat, p))
# Interpret
alpha = 0.05
if p > alpha:
print("Sample is normally distributes (Fail to reject null hypothesis)")
else:
print("Sample is not normally distributes(reject null hypothesis)")
Statistics = 0.962, p = 0.000 Sample is not normally distributes(reject null hypothesis)
Normality Test on log of 'hdl'¶
# Normality test on 'hdl'
stat, p = shapiro(np.log(X_train['hdl']))
print('Statistics = %.3f, p = %.3f' % (stat, p))
# Interpret
alpha = 0.05
if p > alpha:
print("Sample is normally distributes (Fail to reject null hypothesis)")
else:
print("Sample is not normally distributes(reject null hypothesis)")
Statistics = 0.996, p = 0.700 Sample is normally distributes (Fail to reject null hypothesis)
Linear Regression with missing observations filled with median and log of 'hdl'¶
# Replacing 'hdl' column in X_train and X_test with log of 'hdl'
X_train['hdl'] = np.log(X_train['hdl'])
X_test['hdl'] = np.log(X_test['hdl'])
# Getting Non-NAN index values for X_test
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]
# Initializing simple imputer with 'median' strategy
imp_median = SimpleImputer(missing_values = np.nan, strategy = 'median')
# Fit simple imputer with training data (with log 'hdl')
imp_median.fit(X_train)
# Transforming X_train to median filled dataset and converting it to a pandas DataFrame
X_train_median_log_fill = pd.DataFrame(imp_median.transform(X_train))
# Assigning column names to the above dataframe
X_train_median_log_fill.columns= ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu']
# Fit an instance of Linear Regression
reg = LinearRegression().fit(X_train_median_log_fill, y_train)
# Generate prediction for X_test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices])
# Calculate Root Mean Squared error
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
55.685863090763554
Root mean squared error has improved. Looking into column 'ldl' for more improvements¶
# looking closely into ldl column
# Assigning column name 'ldl'
columns_toplt = ['ldl']
# Call the function
plot_hist_and_box(diabetes, columns_toplt)
Linear Regression with missing observations filled with median, log of 'hdl' and removal of oultiers in 'ldl'¶
# Removing outlier index from 'idl'
X_train_nonoutlier_idx = X_train.index[X_train.ldl < X_train.ldl.quantile(0.999)]
X_train = X_train.loc[X_train_nonoutlier_idx]
y_train = y_train.loc[X_train_nonoutlier_idx]
# Getting Non-NAN index values for X_test
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]
# Initializing simple imputer with 'median' strategy
imp_median = SimpleImputer(missing_values = np.nan, strategy = 'median')
# Fit simple imputer with training data (with log 'hdl')
imp_median.fit(X_train)
# Transforming X_train to median filled dataset and converting it to a pandas DataFrame
X_train_median_log_fill = pd.DataFrame(imp_median.transform(X_train))
# Assigning column names to the above dataframe
X_train_median_log_fill.columns= ['age', 'sex', 'bmi', 'bp', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu']
# Fit an instance of Linear Regression
reg = LinearRegression().fit(X_train_median_log_fill, y_train)
# Generate prediction for X_test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices])
# Calculate Root Mean Squared error
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
55.53368308287885
Correlation Matrix¶
plt.figure(figsize = (12, 8))
sns.heatmap(diabetes.corr(), annot = True, cmap = 'Spectral', linewidth = 2, linecolor = '#000000', fmt = '.3f')
plt.show()
It can be observed that the correlation of 'tc' and 'idl' to 'target' is very low. Hence the we might be able to improve the regression model by dropping the column 'tc'
Linear Regression with meadian filled, log of 'hdl', removed outliers in 'idl' and dropping 'tc'¶
# Removing outlier index from 'idl'
X_train_nonoutlier_idx = X_train.index[X_train.ldl < X_train.ldl.quantile(0.999)]
X_train = X_train.loc[X_train_nonoutlier_idx]
y_train = y_train.loc[X_train_nonoutlier_idx]
# Getting Non-NAN index values for X_test
nonnan_test_indices = X_test.index[~X_test.isna().max(axis=1)]
# Getting column names except 'tc'
col_no_tc = [i for i in X_train.columns if i != 'tc']
# Initializing simple imputer with 'median' strategy
imp_median = SimpleImputer(missing_values = np.nan, strategy = 'median')
# Fit simple imputer with training data (with log 'hdl')
imp_median.fit(X_train.loc[:, col_no_tc])
# Transforming X_train to median filled dataset and converting it to a pandas DataFrame
X_train_median_log_fill = pd.DataFrame(imp_median.transform(X_train.loc[:, col_no_tc]))
# Assigning column names to the above dataframe
X_train_median_log_fill.columns= ['age', 'sex', 'bmi', 'bp', 'ldl', 'hdl', 'tch', 'ltg', 'glu']
# Fit an instance of Linear Regression
reg = LinearRegression().fit(X_train_median_log_fill, y_train)
# Generate prediction for X_test dataset
pred = reg.predict(X_test.loc[nonnan_test_indices, col_no_tc])
# Calculate Root Mean Squared error
root_mean_squared_error(y_test.loc[nonnan_test_indices], pred)
55.619929987460424
Removal of 'tc' column has lead to worse performance
Pair Plots¶
sns.pairplot(diabetes)
plt.show()
A Simple function to perform EDA - fasteda¶
The fast_eda from fasteda package does all the above EDA analysis in single step
import warnings
# Suppress all warnings
warnings.filterwarnings('ignore')
# Now run fast_eda(diabetes) function
fast_eda(diabetes)
DataFrame Head:
| age | sex | bmi | bp | tc | ldl | hdl | tch | ltg | glu | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 59.0 | 0.0 | 32.1 | 101.0 | 157.0 | 93.2 | 38.0 | 4.0 | 4.8598 | 87.0 | 151.0 |
| 1 | 48.0 | 1.0 | 21.6 | 87.0 | 183.0 | 103.2 | 70.0 | 3.0 | 3.8918 | 69.0 | 75.0 |
| 2 | 72.0 | 0.0 | 30.5 | 93.0 | 156.0 | 93.6 | 41.0 | 4.0 | 4.6728 | 85.0 | 141.0 |
DataFrame Tail:
| age | sex | bmi | bp | tc | ldl | hdl | tch | ltg | glu | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 439 | 60.0 | 0.0 | 24.9 | 99.67 | 162.0 | 106.6 | 43.0 | 3.77 | 4.1271 | 95.0 | 132.0 |
| 440 | 36.0 | 1.0 | 30.0 | 95.00 | 201.0 | 125.2 | 42.0 | 4.79 | 5.1299 | 85.0 | 220.0 |
| 441 | 36.0 | 1.0 | 19.6 | 71.00 | 250.0 | 133.2 | 97.0 | 3.00 | 4.5951 | 92.0 | 57.0 |
----------------------------------------------------------------------------------------------------
Missing values:
| Ā | 0 |
|---|---|
| bmi | 22 |
| tc | 22 |
| tch | 22 |
----------------------------------------------------------------------------------------------------
MSNO Matrix:
---------------------------------------------------------------------------------------------------- Shape of DataFrame: (442, 11) ---------------------------------------------------------------------------------------------------- DataFrame Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 442 entries, 0 to 441 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 442 non-null float64 1 sex 442 non-null float64 2 bmi 420 non-null float64 3 bp 442 non-null float64 4 tc 420 non-null float64 5 ldl 442 non-null float64 6 hdl 442 non-null float64 7 tch 420 non-null float64 8 ltg 442 non-null float64 9 glu 442 non-null float64 10 target 442 non-null float64 dtypes: float64(11) memory usage: 38.1 KB ---------------------------------------------------------------------------------------------------- Describe DataFrame:
| Ā | count | mean | median | std | min | 25% | 50% | 75% | max | skewness | kurtosis |
|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 442 | 48.518 | 50 | 13.109 | 19 | 38.25 | 50 | 59 | 79 | -0.219726 | -0.714041 |
| sex | 442 | 0.532 | 1 | 0.5 | 0 | 0 | 1 | 1 | 1 | -0.085793 | -1.992640 |
| bmi | 420 | 26.358 | 25.7 | 4.405 | 18 | 23.175 | 25.7 | 29.325 | 42.2 | 0.582185 | 0.059985 |
| bp | 442 | 94.647 | 93 | 13.831 | 62 | 84 | 93 | 105 | 133 | 0.271068 | -0.531885 |
| tc | 420 | 188.831 | 186 | 34.691 | 97 | 164 | 186 | 209 | 301 | 0.383226 | 0.226904 |
| ldl | 442 | 115.439 | 113 | 30.413 | 41.6 | 96.05 | 113 | 134.5 | 242.4 | 0.430437 | 0.538215 |
| hdl | 442 | 49.788 | 48 | 12.934 | 22 | 40.25 | 48 | 57.75 | 99 | 0.790610 | 0.987366 |
| tch | 420 | 4.072 | 4 | 1.297 | 2 | 3 | 4 | 5 | 9.09 | 0.737344 | 0.444940 |
| ltg | 442 | 4.641 | 4.62 | 0.522 | 3.258 | 4.277 | 4.62 | 4.997 | 6.107 | 0.300617 | -0.160402 |
| glu | 442 | 91.26 | 91 | 11.496 | 58 | 83.25 | 91 | 98 | 124 | 0.220172 | 0.253283 |
| target | 442 | 152.133 | 140.5 | 77.093 | 25 | 87 | 140.5 | 211.5 | 346 | 0.430462 | -0.876956 |
----------------------------------------------------------------------------------------------------
DataFrame Correlation:
----------------------------------------------------------------------------------------------------
DataFrame Pairplot:
----------------------------------------------------------------------------------------------------
Histogram(s) & Boxplot(s):
----------------------------------------------------------------------------------------------------
Countplot(s):
Classification¶
Import Iris Data Set¶
# Load the data set from sklearn
iris_sklearn = load_iris()
# Extract the data and target labels as a numpy array
iris_npy = np.concatenate([iris_sklearn['data'], np.atleast_2d(iris_sklearn['target']).T], axis=1)
# Define column names
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target']
# Convert the numpy array to a pandas dataframe with column names
iris = pd.DataFrame(iris_npy, columns=col_names)
# Print a description of the dataset
print(iris_sklearn['DESCR'])
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
|details-start|
**References**
|details-split|
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
|details-end|
iris['target'].sample(5)
121 2.0 16 0.0 19 0.0 23 0.0 8 0.0 Name: target, dtype: float64
class_names = dict(zip(list(map(float, range(len(iris_sklearn['target_names'])))), iris_sklearn['target_names']))
print(class_names)
{0.0: 'setosa', 1.0: 'versicolor', 2.0: 'virginica'}
Performing EDA for classification using fasteda¶
fast_eda(iris, target = 'target')
DataFrame Head:
| sepal_length | sepal_width | petal_length | petal_width | target | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | 0.0 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | 0.0 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | 0.0 |
DataFrame Tail:
| sepal_length | sepal_width | petal_length | petal_width | target | |
|---|---|---|---|---|---|
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | 2.0 |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | 2.0 |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | 2.0 |
----------------------------------------------------------------------------------------------------
Missing values:
| Ā | 0 |
|---|
---------------------------------------------------------------------------------------------------- Shape of DataFrame: (150, 5) ---------------------------------------------------------------------------------------------------- DataFrame Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal_length 150 non-null float64 1 sepal_width 150 non-null float64 2 petal_length 150 non-null float64 3 petal_width 150 non-null float64 4 target 150 non-null float64 dtypes: float64(5) memory usage: 6.0 KB ---------------------------------------------------------------------------------------------------- Describe DataFrame:
| Ā | count | mean | median | std | min | 25% | 50% | 75% | max | skewness | kurtosis |
|---|---|---|---|---|---|---|---|---|---|---|---|
| sepal_length | 150 | 5.843 | 5.8 | 0.828 | 4.3 | 5.1 | 5.8 | 6.4 | 7.9 | 0.311753 | -0.573568 |
| sepal_width | 150 | 3.057 | 3 | 0.436 | 2 | 2.8 | 3 | 3.3 | 4.4 | 0.315767 | 0.180976 |
| petal_length | 150 | 3.758 | 4.35 | 1.765 | 1 | 1.6 | 4.35 | 5.1 | 6.9 | -0.272128 | -1.395536 |
| petal_width | 150 | 1.199 | 1.3 | 0.762 | 0.1 | 0.3 | 1.3 | 1.8 | 2.5 | -0.101934 | -1.336067 |
| target | 150 | 1 | 1 | 0.819 | 0 | 0 | 1 | 2 | 2 | 0.000000 | -1.500000 |
----------------------------------------------------------------------------------------------------
DataFrame Correlation:
----------------------------------------------------------------------------------------------------
DataFrame Pairplot:
----------------------------------------------------------------------------------------------------
Histogram(s) & Boxplot(s):
----------------------------------------------------------------------------------------------------
Countplot(s):
plt.axis('equal')
sns.scatterplot(iris, x='petal_width', y='sepal_width', hue='target', palette=sns.color_palette("hls", iris['target'].nunique()))
plt.show()
# Define a function to format value counts into percentages
def autopct_format(values):
def my_format(pct):
total = sum(values)
val = int(round(pct*total/100.0))
return '{:.1f}%\n({v:d})'.format(pct, v=val)
return my_format
# Get value counts
vc = iris['target'].value_counts()
# Draw a pie chart using value counts and the `autopct_format` format
_ = plt.pie(vc, labels = vc.rename(class_names).index, autopct=autopct_format(vc))